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Abstract 
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neural  networks  have  been  constructed  to  perform,  can  be  regarded  as  synthesizing  an 
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recognition  and  neural  network  algorithms.  In  this  note,  we  extend  the  theory  by 
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1  Introduction 


In  previous  papers  (Poggio  and  Girosi,  1989,  1990)  we  have  shown  the  equiv¬ 
alence  between  regularization  and  a  class  of  three-layer  networks  that  we 
called  regularization  networks  and  that  are  related  to  the  classical  interpola¬ 
tion  technique  of  Radial  Basis  Functions. 

Let  S  =  {(xt,  y,)  €  Rn  x  R\i  =  1 ,  ...Ar}  be  a  set  of  data  that  we  want  to  ap¬ 
proximate  by  means  of  a  function  /.  The  regularization  approach  (Tikhonov, 
1963;  Tikhonov  and  Arsenin,  1977;  Morozov,  1984;  Bertero,  1986)  selects  the 
function  /  that  solves  the  variational  problem  of  minimizing  the  functional 


«[/l=Ete-/(x,))J+A||P/||J  (1) 

t=l 

where  P  is  a  constraint  operator  (usually  a  differential  operator),  ||  •  ||  is 
a  norm  on  the  function  space  to  whom  Pf  belongs  (usually  the  L 2  norm) 
and  A  is  a  positive  real  number,  the  so  called  regularization  parameter.  The 
structure  of  the  operator  P,  that  is  called  “stabilizer”,  embodies  the  a  priori 
knowledge  about  the  solution,  and  therefore  depends  on  the  nature  of  the 
particular  problem  that  has  to  be  solved  (for  instance,  it  is  not  needed  in 
the  case  of  P  corresponding  to  a  Gaussian  or  bell-shaped  Green’s  function). 
We  have  shown  (Poggio  and  Girosi,  1989)  that  the  solution  of  the  variational 
problem  (1)  has  the  following  simple  form: 


N 

/(x)  =  £ciG(x;x<)  +  p(x) 

1=1 

where  G(x)  is  the  Green’s  function  (Stakgold,  1979)  of  the  self-adjoint  dif¬ 
ferential  operator  PP,  P  being  the  adjoint  operator  of  P,  p(x)  is  a  linear 
combination  of  functions  that  span  the  null  space  of  P,  and  the  coefficients 
Ci  satisfy  a  linear  system  of  equations  that  depend  on  the  N  “examples”,  i.e. 
the  data  to  be  approximated.  The  form  of  the  term  p(x)  depends  on  the  sta¬ 
bilizer  that  has  been  chosen  and  on  the  boundary  conditions,  and  therefore 
on  the  particular  problem  that  has  to  be  solved.  For  this  reason,  and  since 
its  inclusion  does  not  modify  the  main  conclusions,  we  will  disregard  it  in  the 
following.  If  P  is  an  operator  with  radial  symmetry,  the  Green’s  function  G 

is  radial  and  therefore  the  approximating  function  becomes:  — - 
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(2) 


/(x)  =  5>G(||x-Xi||a), 

1=1 

which  is  a  sum  of  radial  functions,  each  with  its  center  X;  on  a  distinct  data 
point.  Thus  the  number  of  radial  functions,  and  corresponding  centers,  is 
the  same  as  the  number  of  examples. 

In  this  note  we  indicate  how  to  extend  the  technique  into  three  natural 
directions: 

1.  The  computation  of  a  solution  of  the  form  (2)  has  a  complexity  (number 
of  radial  functions)  that  is  independent  of  the  dimensionality  of  the 
input  space  but  is  on  the  order  of  the  dimensionality  of  the  training 
set  (number  of  examples),  which  is  usually  high.  We  show  how  to 
justify  in  terms  of  the  regularizationan  framework  an  approximation 
of  equation  (2)  in  which  the  number  of  centers  is  much  smaller  than 
the  number  of  examples  and  the  positions  of  the  centers  are  modified 
during  learning  (Poggio  and  Girosi,  1989).  The  key  idea  is  to  consider 
a  specific  form  of  an  approximation  to  the  solution  of  the  standard 
regularization  problem.  Moving  centers  are  equivalent  to  the  free  knots 
of  nonlinear  splines.  In  the  context  of  networks  they  were  first  suggested 
as  a  potentially  useful  heuristics  by  Broomhead  and  Lowe  (1988)  and 
used  by  Moody  and  Darken  (1989). 

2.  It  is  natural  to  try  to  extend  the  form  of  the  solution  (2)  by  considering 
the  superposition  of  different  types  of  Green’s  functions  (Poggio  and 
Girosi,  1989,  1990a)  (for  example  basis  functions  of  different  scales). 
This  extension  is  natural  within  the  framework  of  regularization  (and 
has  a  direct  Bayesian  interpretation)  by  considering  a  more  general 
functional  than  equation  (1)  containing  several  stabilizers.  We  will 
show  how  the  well-defined  but  underconstrained  variational  problem 
associated  with  the  new  functional  can  be  transformed  into  an  over¬ 
constrained  problem. 

3.  In  equation  (2)  the  norm  j|x  —  X;||  may  be  considered  as  a  weighted 
norm 

!lx  -  x»llw  =  (x  ~  x,)rWrW(x  -  xt) 
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where  W  is  a  square  matrix  and  the  superscript  T  indicates  the  trans¬ 
pose.  In  the  simple  case  of  diagonal  W  the  diagonal  elements  wa  assign 
a  specific  weight  to  each  input  coordinate,  and  the  standard  Euclidean 
norm  is  obtained  when  W  is  set  to  the  identity  matrix.  They  play 
a  critical  role  whenever  different  types  of  inputs  are  present.  We  will 
show  how  the  weighted  norm  idea  can  be  derived  from  a  slightly  more 
general  functional  than  equation  (1).  The  associated  variational  prob¬ 
lem  is  well-defined  but  underconstrained;  it  can  be  transformed  into  an 
overconstrained  problem  by  using  a  certain  approximation  technique. 

We  call  Hyper  Basis  Functions ,  in  short  HyperBFs,  the  most  general  form 
of  regularization  networks  based  on  these  three  extensions. 

2  Moving  Centers 

The  solution  given  by  standard  regularization  theory  to  the  approximation 
problem  can  be  very  expensive  in  computational  terms  when  the  number  of 
examples  is  very  high.  The  computation  of  the  coefficients  of  the  expansion 
can  become  then  a  very  time  consuming  operation:  its  complexity  grows 
polynomially  with  N,  (roughly  as  N3)  since  an  N  x  N  matrix  has  to  be  in¬ 
verted.  In  addition,  the  probability  of  ill-conditioning  is  higher  for  larger  and 
larger  matrices  (it  grows  like  N3  for  a  N  x  N  uniformly  distributed  random 
matrix)  (Demmel,  1987).  We  now  show  a  way  to  reduce  the  complexity  of  the 
problem,  introducing  an  approximation  to  the  regularized  solution.  While 
the  exact  regu'arization  solution  is  equivalent  to  generalized  splines  with  fixed 
knots,  the  approximated  solution  is  equivalent  to  generalized  splines  with  free 
knots. 


2.1  An  approximation  to  the  regularization  solution 

A  standard  technique,  sometimes  known  as  Galerkin’s  method,  that  has  been 
used  to  find  approximate  solutions  of  variational  problems,  is  to  expand  the 
solution  on  a  finite  basis.  The  approximated  solution  /*(x)  has  then  the 
following  form: 


/'(*)  =  Hc,<£,(x)  (3) 

i=l 
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where  {^i}"=1  is  a  set  of  linearly  independent  functions  (Mikhlin,  1965).  The 
coefficients  c,  are  usually  found  according  to  some  rule  that  guarantees  a 
minimum  deviation  from  the  true  solution.  In  the  case  of  standard  regu¬ 
larization,  when  the  functional  to  minimize  is  given  by  equation  (1),  this 
method  gives  the  exact  solution  if  n  is  equal  to  the  numer  of  data  points  N, 
and  f<Mr=i  =  {G(x;x<)}iln  where  G  is  the  Green’s  function  of  the  opera¬ 
tor  PP.  In  this  case  the  unknown  coefficients  of  the  expansion  (3)  can  be 
obtained  in  a  simple  way  by  substituting  expansion  (3)  in  the  regularization 
functional  (1),  that  becomes  a  function  //[/’]  =  H’(cy, . . . ,  c^),  and  then  by 
minimizing  H[fm }  with  respect  to  the  coefficients,  that  is  by  setting: 

=  0  i  =  1, . . . ,  iV.  (4) 

It  can  be  easily  shown  (Poggio  and  Girosi,  1989)  that,  if  the  Green’s 
function  vanishes  on  the  boundary  of  the  region  that  is  considered,  the  set  of 
equations  (4)  is  a  linear  system  whose  solution  gives  the  standard  regulariza¬ 
tion  coefficients.  In  more  general  cases  the  basis  {<£,}"=1  should  be  enlarged, 
to  include  terms  that  generate  the  null  space  of  P,  in  order  to  obtain  the 
correct  solution.  For  simplicity,  we  disregard  these  terms  in  *he  following, 
since  they  do  not  change  the  main  conclusions.  A  natural  approximation  to 
the  exact  solution  will  be  then  of  the  form: 

/-(X)  =  £>„G(x;M  (5) 

a=l 

where  the  parameters  tQ,  that  we  call  “centers”,  and  the  coefficients  cQ  are 
unknown,  and  are  in  general  fewer  than  the  data  points  ( n  <  N).  This  form 
of  solution  has  the  desirable  property  of  being  an  universal  approximator  for 
continuous  functions  (Girosi  and  Poggio,  1989)  and  to  be  the  only  choice 
that  guarantees  that  in  the  case  of  n  =  N  and  {tQ}£=1  =  {x,}”-!  the  correct 
solution  (of  equation  1)  is  consistently  recovered.  We  will  see  later  in  section 
(5)  how  to  find  the  unknown  parameters  of  this  expansion. 

3  Different  types  of  Basis  Functions. 

This  scheme  can  be  further  extended  by  considering  in  equation  (5)  the 
superposition  of  different  types  of  functions  G,  such  as  Gaussians  at  different 


dH[f] 
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scales. 

The  function  /  to  be  approximated  is  regarded  as  the  sum  of  p  compo¬ 
nents  fm,  m  =  1  , ...,p,  each  component  having  a  different  prior  probabil¬ 
ity.  This  assumption  is  clearly  meaningful  only  if  p  «  N.  Therefore  the 
functional  H[f]  to  minimize  will  contain  p  stabilizers  Pm,  p  regularization 
parameters  Am  and  will  be  written  as 

m  =  E(*  -  E  /"(*.))’  +  E  Am||P”m2  .  (6) 

i=l  m=l  m=l 

The  Euler- Lagrange  equations  associated  with  equation  (6)  have  the  form: 

PmPm  =  m  =  l . p.  (7) 

Am  .=1  k=l 

As  in  the  case  of  standard  regularization,  the  solution  of  equation  (7)  is 
a  linear  superposition  of  Green’s  functions: 

rw  =  E«rc"(*;*i).  (8) 

i=i 

The  function  F(x)  that  minimizes  the  functional  H[f  \  is  then  a  linear  su¬ 
perposition  of  linear  superpositions  of  the  Green’s  functions  Gm  correspond¬ 
ing  to  the  stabilizers  Pm,  that  is 

f(*)=  +Kx),  (9) 

m=l  1=1 

where  p(x)  is  a  linear  combination  of  functions  that  span  the  null  spaces 
of  the  stabilizers.  For  instance,  when  Gm(x)  are  Gaussian  a  polynomial  is 
not  needed,  though  it  can  always  be  added.  For  other  Green’s  functions  the 
theory  requires  an  appropriate  p(x). 

Substitution  of  equation  (8)  in  equation  (7)  yields  a  linear  system  for  the 
coefficients  c™.  There  is  a  simple  relation  between  the  coefficients  associated 
to  two  different  stabilizers,  that  is 

c^Am=c"An,  i  =  l,...,fV;  n,m  =  l,...,p. 
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This  means  that  if  a  component  /m(x)  of  the  solution  is  given,  the  other 
p  —  1  ones  can  be  recovered  by  a  simple  scaling  operation.  This  is  expected, 
since  the  underlying  variational  problem  is  underconstrained:  we  are  trying 
to  obtain  Np  coefficients  from  a  set  of  N  data  points.  The  form  of  the 
solution  (9)  is  appealing:  if  all  the  coefficients  c™  were  independent  and  free 
to  vary ,  the  system  could  “choose”  among  different  stabilizers,  depending  on 
the  site.  In  order  to  retain  the  form  (9)  of  the  solution,  while  making  the 
problem  overconstrained  instead  of  underconstrained,  we  choose  a  solution 
of  the  approximation  problem  of  the  following  form  (instead  of  equation  9): 

f'M=t/mM+pW.  (io) 

m  =  l 
A  m 

fm(x)  =  'E,  c"G"(x;C)  (ii) 

a=i 

where  (1  -f  d)  I\m  <  N  and  the  coefficients  c™  and  the  centers  t™  are 

unknowns.  They  can  be  found  with  a  technique  similar  to  the  one  described 
in  section  (5).  Notice  that  equations  (10)  and  (11)  are  of  the  same  form  as 
equation  (5)  and  share  its  approximation  properties. 

3.1  Multiple  Scales. 

This  method  leads  in  particular  to  radial  basis  functions  of  multiple  scales 
for  the  reconstruction  of  the  function  /.  Suppose  we  know  a  priori  that 
the  function  to  be  approximated  has  components  on  a  number  p  of  scales 
<7 1, . . . ,  <rp:  we  can  use  this  information  to  choose  a  set  of  p  stabilizers  whose 
Green’s  functions  are,  for  example,  Gaussians  of  variance  ct\,.  . .  ,av.  We  have 
(Poggio  and  Girosi,  1989,  1990a)  : 

°o  , 

\\Pmrf  =  t,a”  dx(Dkf(x))2 

Si  Jr" 

where  D2k  =  V2fc,  D2k+l  =  VV2;c  and  a™  =  |j“t,  V  being  the  gradient 
operator.  As  a  result,  the  solution  will  be  a  superposition  of  superpositions 
of  Gaussians  of  different  variances.  Of  course,  the  Gaussians  with  large  a 
should  be  preset,  depending  on  the  nature  of  the  problem,  to  be  fewer  and 
therefore  on  a  sparser  grid,  than  the  Gaussians  with  a  small  a. 
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The  HyperBF  method  also  yields  non-radial  Green’s  functions  -  by  using 
appropriate  stabilizers  -  and  also  Green’s  functions  with  a  lower  dimension¬ 
ality  -  by  using  the  associated  fm  and  Pm  in  a  suitable  lower- dimensional 
subspace.  Again  this  reflects  a  priori  information  that  may  be  available  about 
the  nature  of  the  mapping  to  be  learned.  In  the  latter  case  the  information 
is  that  the  mapping  is  of  lower  dimensionality  or  has  lower  dimensional  com¬ 
ponents. 

4  Weighted  norm 

The  norm  in  equation  (5)  is  usually  intended  as  an  Euclidean  norm.  If  the 
components  of  x  are  of  different  types,  it  is  natural  to  consider  a  weighted 
norm  defined  as 


l|x||W  =  xrWTWx, 

since  the  relative  scale  of  the  components  is  otherwise  arbitrary.  The  case 
in  which  the  matrix  W  is  known  (from  prior  information)  does  not  present 
any  difficulty.  It  is  interesting,  however,  to  see  what  it  means  in  terms  of  the 
underlying  regularization  principle. 

4.1  Weighted  norm  and  regularization 

The  regularization  principle  consists  in  finding  the  /  that  minimizes  the 
functional: 


ffw[/]  =  By.-/(*))2  +  A!lp/llw  (12) 

1=1 

where  we  assume  that  P  is  radially  symmetric  in  the  variable  y  and  that 
y  =  Wx  (i.e.  y  is  a  known  linear  transformation  of  x  that  depends  on  the 
parameters  W).  This  means  that  the  smoothness  constraint  is  given  in  a 
space  that  is  an  affine  transformation  of  the  original  x  space.  The  Green’s 
function  associated  with  equation  (12)  is 

G(||y||2)  =  G(|!x||W)  (13) 

with  ||x|j\v  =  xTWTWx. 
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Suppose  now  that  the  parameters  W  are  unknown.  We  can  formulate  the 
problem  of  finding  /  and  W  that  minimize  the  functional  Notice 

that  the  relevant  quantity  is  M  =  WrW,  since  W  only  appears  in  this 
form.  The  matrix  M  is  symmetric  and  positive  definite;  it  has  therefore  a 
unique,  symmetric  “square  root”  R,  such  that  M  =  RrR  =  R2.  One  could 
chose  to  identify  W  with  R.  W  would  be  therefore  symmetric,  with 
independent  parameters. 

Thus  finding  the  optimal  W  corresponds  to  finding  the  best  stabilizer 
among  those  that  are  expressed  in  a  coordinate  system  which  is  a  linear 
transformation  of  the  original  one.  The  parameters  W  of  the  linear  trans¬ 
formation  become  parameters  of  H  with  respect  to  which  the  functional  is 
minimized. 

The  simplest  case  is  the  case  of  W  diagonal  and  G(x)  =  e~x2 .  In  this 
case 


G(||x|fo)  =  e~T*w'  e~z*w2 


and  thus  the  components  u>,  of  W  are  equivalent  to  the  inverse  of  the  variance 
rr  of  each  component  of  the  multidimensional  Gaussian. 

In  the  probabilistic  interpretation  of  standard  regularization  (see  Poggio 
and  Girosi,  1989)  the  term  A||P/||2  in  the  regularization  functional  corre¬ 
sponds  to  the  following  prior  probability  in  a  Bayesian  formulation  in  which 
the  MAP  (Maximum  A  Posteriori)  estimate  is  sought: 


Prob(f)  =  e~x"p'"\ 

Our  extension  corresponds  to  choosing  the  stabilizer  Pw  =  [|P/(y)||2, 
with  y  =  Wx.  The  stabilizer  P*w  is  parametrized  by  the  matrix  W  and 
defines  a  prior  Prob\v(f)  which  is  also  parametrized  by  W. 

The  solution  of  the  variational  problem  (12)  has  the  form 

/(x)  =  £ct-G(||x-x,||^),  (14) 

i=i 

where  the  coefficients  c,  and  the  elements  of  the  matrix  W  must  be  estimated. 
Here  again  we  are  facing  an  underconstrained  variational  problem,  since  we 
trying  to  determine  N  +  parameters  from  N  data  points.  The  same 

considerations  of  section  (G )  apply:  in  order  to  transform  the  problem  into 
an  overconstrained  problem,  we  look  for  a  solution  of  the  form 
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(15) 


/■<*)- £*0(1* -MW) 

0  =  1 

5  How  to  learn  centers’  positions  and  norm 
weights 

Suppose  that  we  look  for  an  approximated  solution  of  the  regularization  prob¬ 
lem  of  the  form  (15).  We  now  have  the  problem  of  finding  the  n  coefficients 
ca,  the  d  x  n  coordinates  of  the  centers  ta  and  the  T-  elements  of  the 
matrix  M  so  that  the  expansion  (12)  is  optimal.  To  avoid  too  many  indeces, 
we  will  only  consider  here  the  case  p  =  1  in  eq.  10.  The  extension  is  obvious. 
In  this  case  we  can  use  the  natural  definition  of  optimality  given  by  the  func¬ 
tional  H .  We  then  impose  the  condition  that  the  set  {cQ,  tQ|a  =  1, ...,  n}  and 
'.he  matrix  M  must  be  such  that  they  minimizes  //[/*],  and  the  following 
equations  must  be  satisfied: 


BH{f\  „  OH[f\  3H[f]  „ 

dca  °’  <9t„  _°’  5M  ”  °  ~  1 . " 

Gradient-descent  is  probably  the  simplest  approach  for  attempting  to 
find  the  solution  to  this  problem,  though,  of  course,  it  is  not  guaranteed 
to  converge.  Several  other  iterative  methods,  such  as  versions  of  conjugate 
gradient  and  simulated  annealing  (Kirkpatrick  et  ah,  1983)  may  be  more 
efficient  than  gradient  descent  and  should  be  used  in  practice.  Since  the 
function  //[/*]  to  minimize  is  in  general  non-convex,  a  stochastic  term  in  the 
gradient  descent  equations  may  be  advisable  to  avoid  local  minima.  In  the 
stochastic  gradient  descent  method  the  values  of  ca,  ta  and  M  that  minimize 
//[/*]  are  regarded  as  the  coordinates  of  the  stable  fixed  point  of  the  following 
stochastic  dynamical  system: 

Cct  —  UJ  "f*  TJot  (0?  ^ 

C/Cfy 

dH\n 

t "  +  q  =  1 - n 
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where  T]a(t),  na{t)  and  fl  (f)  are  white  noise  of  zero  mean  and  u;  is  a  parameter 
determining  the  microscopic  timescale  of  the  problem  and  is  related  to  the 
rate  of  convergence  to  the  fixed  point.  Defining 

A,  =  yi  -  /*(x)  =  y,  -  ]T  caG(||xi  -  tjvv) 

0  =  1 

and  setting  A  =  0  for  simplicity  (the  more  general  case  can  be  approached  in 
a  similar  way)  in  equation  (1)  we  obtain 


HID  =  HcXM  =  Z(A,)2. 

t=i 

The  important  quantities  -  that  can  be  used  in  more  efficient  schemes 
than  gradient  descent  -  are,  with  ||x,  —  tajj^v  =  (x,  -  ta)rM(x,  -  tQ)  and 
M  =  WTW: 

•  for  the  ca 


^fQ  =  -2i;A,G(||x,-t0|lW)  ;  (16) 

OCa  ,=1 

•  for  the  centers  ta 

=  4c°  Y  A,G'(||x,  -  ta[|w)M(x,  -  tQ)  (17) 

•  and  for  M 


Y  c°  Y  A.G'(||x:  -  t0||vv)Q.,o  (18) 

o=l  i=l 

where  Q,_a  =  (x,  —  tc.)(x,  -  t0)r  is  a  dyadic  product  and  G'  is  the  first 
derivative  of  G. 

Remarks 
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1.  Instead  of  equation  (IS)  for  M  the  following  equation  can  be  used  for 

W: 

dH\fm]  n  N 

=  — 4W  £  Y.  A,G'(l|x,  -  (19) 

0=1  1=1 

2.  From  equation  (18)  the  matrix  M  is  guaranteed  to  remain  symmetric  in 

a  deterministic  gradient  descent  scheme,  since  the  right  hand-side  of  the 
equation  is  symmetric  (because  the  Qi<a  are  correlation  matrices  and  a 
linear  combination  of  symmetric  matrices  is  symmetric).  Of  course,  the 
initial  value  mu^t  be  a  symmetric  matrix  and  in  the  stochastic  update 
scheme,  the  noise  term  must  not  break  the  symmetry.  The  matrix 
M  must  satisfy  ditional  constraint  of  remaining  positive  definite 

(since  the  scalar  product  xrMx  must  be  non-negr  live).  We  conjecture 
that  equations  (16),  (17)  and  (18)  conserve  the  positive  definiteness  of 
M  if  G  is  positive  definite. 


3. 


Equation  (16)  has  a  simple  interpretation:  the  correction  is  equal  to 
the  sum  over  the  examples  of  the  products  between  the  error  on  that 
example  and  the  “activity”  of  the  “unit”  that  represents  with  its  center 
that  example.  Notice  that  H[f‘\  is  quadratic  in  the  coefficients  ca,  and 
if  the  centers  and  the  matrix  M  are  kept  fixed,  it  can  be  shown  (Poggio 
and  Girosi,  19S9)  that  the  optimal  coefficients  are  given  by 


c  =  (GT  G+\g)-lGTy  (20) 

where  we  have  defined  (y);  =  t/,,  (c)Q  =  cQ,  (G)l£1  =  G(Xi;t0)  and 
(g)ap  —  G(ta;  t#).  If  A  is  let  go  to  zero,  the  matrix  on  the  right  side  of 
equation  (20)  converges  to  the  pseudoinverse  of  G  (Albert,  1972),  and  if 
the  Green’s  function  is  radial  the  approximation  method  of  Broomhead 
and  Lowe  (1988)  is  recovered. 

4.  Equation  (17)  is  similar  to  task-dependent  clustering  (Poggio  and  Girosi, 
1989).  This  can  be  best  seen  by  assuming  that  A,  are  constant:  then 
the  gradient  descent  updating  rule  makes  the  centers  move  as  a  func¬ 
tion  of  the  majority  of  the  data,  that  is  of  the  position  of  the  clusters. 
In  this  case  a  technique  similar  to  the  k-means  algorithm  is  recovered 
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(MacQueen,  1967;  Moody  and  Darken,  1989).  Fquating  ^  to  zero 
we  notice  that,  when  the  matrix  M  is  set  to  the  identity  matrix,  the 
optimal  centers  ta  satisfy  the  following  set  of  nonlinear  equations: 


t«  = 


EiPfxj 


a  =  1, . . . ,  n 


where  P?  =  A.G'dlx,-  — 10||2).  The  optimal  centers  are  then  a  weighted 
sum  of  the  data  points.  The  weight  P°  of  the  data  point  i  for  a  given 
center  ta  is  high  if  the  interpolation  error  A,  is  high  there  and  the  radial 
basis  function  centered  on  that  knot  changes  quickly  in  a  neighborhood 
of  the  data  point.  This  observation  suggests  faster  update  schemes,  in 
which  a  suboptimal  position  of  the  centers  is  first  found  and  then  the 
ca  are  determined,  similarly  to  the  algorithm  developed  and  tested 
successfully  by  Moody  and  Darken  (1989). 

5.  Equation  (19)  (by  assuming  that  £”=1  ^A.-G'dlx,  —  tall^y)  is  asymp¬ 
totically  constant  (!!))  contains  the  quantity  Qi.a  which  is  an 
estimate  of  the  correlation  matrix  of  all  the  examples  relative  to  ta 
(modulus  a  normalization  factor).  Let  us  define  Cm,a  as  the  d  x  m  ma¬ 
trix  whose  columns  are  the  vectors  of  the  examples  Xj  — 10, ...,  xm  —  tQ. 
Then  YliLi  Qi,a  can  be  written  as  YiiLi  Q i,a  =  Cs^taCjj  a  and  is  the  d  x  d 
correlation  matrix  (d  being  the  number  of  components  of  x).  Interest¬ 
ingly,  in  this  case,  equation  (19),  when  inserted  in  the  gradient  descent 
equation,  has  the  form: 


W  =  -WQ 

which  has  the  solution 


N 

W(t)  =  W(0)e-Q(  =  W(0)  £  e~Xjtejef 

:= i 

where  ej  are  the  eigenvectors  of  Q  and  Ay  are  the  associated  eigenvalues. 
All  eigenvectors  will  decay  to  0,  the  ones  with  the  largest  eigenvalues 
fastest.  Since  in  the  full  equation  the  other  terms  such  as  A,  will  keep 
W  from  decaying  to  0,  we  may  expect  that  W  will  converge  to  a  matrix 
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with  rows  that  are  similar  to  the  eigenvectors  of  Q  with  the  smallest 
eigenvalues.  In  other  words,  the  equation  should  converge  to  rows  of  W 
that  span  the  space  orthogonal  to  the  space  spanned  by  the  principal 
components  of  the  input  examples  (i.e.  the  eigenvectors  of  Q  with  the 
largest  eigenvalues).  In  this  case,  the  matrix  M  is  a  projection  operator 
that  projects  x  into  a  space  orthogonal  to  the  space  of  the  principal 
components.  The  principal  components  are  the  singular  vectors  of  X, 
with  the  property  that  they  span  a  nested  set  of  optimal  subspaces. 
This  interpretation  of  the  gradient  descent  equation  is  just  a  rough  in¬ 
dication  of  what  may  happen,  because  of  the  very  strong  underlying 
assumptions.  It  turns  out  that  in  the  object  recognition  case  (Poggio 
and  Edelmann,  1990),  the  interpretation  is  perfectly  consistent  with 
what  one  expects,  given  the  (linear)  computational  theory  underlying 
the  problem  (Basri  and  Ullmann,  1990;  see  also  the  appendix  in  Edel¬ 
mann  and  Poggio,  1990).  Under  orthographic  projection,  the  vectors 
representing  views  of  the  same  object  span  a  linear  subspace  with  a  low 
dimension.  Let  us  assume,  according  to  the  above  discussion,  that  W 
projects  a  new  input  vector  into  a  space  orthogonal  to  the  one  spanned 
by  the  principal  components  extracted  from  many  views  of  the  object 
(the  “examples”).  Then,  if  the  new  input  is  another  view  of  the  same 
object,  the  result  will  be  close  to  zero  for  all  units.  In  the  case  of  the 
Gaussian,  for  instance,  this  means  that  each  unit  will  be  maximally 
activated  and  by  suitable  choice  of  c  any  desired  output  may  be  syn¬ 
thesized.  On  the  other  hand,  if  the  new  input  is  the  view  of  a  different 
object,  the  result  of  operating  on  it  with  W  will  be  different  from 
zero  and  possibly  large  enough  to  give  a  very  small  activity  of  the  unit 
making  it  impossible  to  synthesize  a  desired  output  by  an  appropriate 
choice  of  the  c  (the  output  will  be  zero  or  close  to  it).  In  this  case, 
the  appropriate  W  will  solve  the  problem  with  just  one  center  (since 
the  problem  is  linear).  Notice  that  if  W  is  symmetric  (i.e.  if  W  is  the 
square  root  of  M),  it  has  the  same  eigenvectors  of  M,  and  M  and  W 
have  the  same  null  space. 

6.  One  may  think  intuitively  that  it  is  desirable  that  W  is  space  depen¬ 
dent,  that  is  W  =  W(x).  This  assumption,  however,  seems  rather 
meaningless  from  the  point  of  view  of  regularization  theory.  As  a  con¬ 
sequence,  we  believe  that  it  is  wrong  to  assume  W  =  W(x)  in  a  scheme 
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such  as  HyperBF.  On  the  other  hand,  it  makes  theoretically  sense  to 
use  different  HyperBF  networks  for  different  subsets  of  the  domain  of 
the  given  multivariate  function,  each  one  possibly  with  a  different  W. 
We  do  not  have  any  theory,  however,  of  how  to  partition  appropriately 
the  domain  of  the  function.  An  alternative  approach,  that  also  makes 
sense,  is  local  linear  approximation.  In  this  case  one  finds  a  set  of  local 
charts,  somewhat  similarly  to  computing  W(x). 

5.1  A  practical  algorithm 

It  seems  natural  to  try  to  find  a  reasonable  initial  value  for  the  parameters 
c,  tQ,  M,  to  start  the  minimization  process.  In  the  absence  of  more  specific 
prior  information  the  following  heuristics  seems  reasonable. 

•  Set  the  number  of  centers  and  set  the  centers’  positions  to  positions 
suggested  by  cluster  analysis  of  the  data  (or  more  simply  to  a  subset 
of  the  examples’  positions). 

•  Set  the  rows  of  W  to  be  vectors  orthogonal  to  the  eigenvectors  with 

largest  eigenvalues  of  Qi  a. 

•  Use  matrix  pseudo-inversion  to  find  the  ca. 

•  Use  the  ta,  M  =  WTW  and  cQ  found  so  far  as  initial  values  for  gradient 
descent  equations. 

It  should  be  noticed  than  an  even  more  general  strategy  makes  sense  in 
some  cases.  Suppose  that  the  system  can  be  made  to  operate  satisfactorily 
with  the  steps  above  or  perhaps  just  with  the  first  step.  Suppose  also  that  the 
system  can  continue  to  accumulate  examples  while  operating.  An  example 
could  be  an  autonomous  vehicle  that  can  improve,  say,  the  model  of  its 
dynamics  by  collecting  appropriate  example  pairs  while  operating.  Then  it 
makes  sense  to  perform  dimensionality  reduction  and  to  move  the  centers  as 
outlined  above.  As  an  additional  step  one  may  try  to  eliminate  features  that 
receive  little  weight,  if  possible,  and  then  to  add  other  features  while  keeping 
the  previously  found  centers.  This  is  equivalent  to  adding  centers  of  higher 
dimensionality.  Another  iteration  of  moving  centers,  finding  norm  weights, 
eliminating  features  and  centers  then  takes  place. 
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Experiments  with  movable  centres  and  movable  weights  have  been  per¬ 
formed  in  the  context  of  object  recognition  (Poggio  and  Edelman,  1990;  Edel- 
man  and  Poggio,  1990)  and  approximation  of  multivariate  functions  (Caprile, 
Girosi  and  Poggio,  1990)  and  in  both  cases  the  results  are  promising. 

6  Remarks 

1.  Equation  (19)  is  similar  to  an  operation  of  (task-dependent)  dimension¬ 
ality  reduction  (Duda  and  Hart,  1973)  whereas  equation  (17)  is  similar 
to  a  clustering  process. 

2.  It  is  conceivable  that  learning  the  weights  of  the  norm  is  even  more 
important  than  learning  the  centers  and  that  in  many  cases  it  may  be 
preferable  to  set  the  centers  to  a  representative  subset  of  the  data  and 
to  keep  them  fixed  thereafter. 

3.  A  specific  matrix  W  corresponds  to  a  specific  metric  in  the  multidi¬ 
mensional  input  space:  W  projects  the  input  vector  into  the  subspace 
spanned  by  its  rows.  In  the  case  of  the  rows  of  W  spanning  the  space 
orthogonal  to  the  principal  components  of  the  inputs,  W  assigns  a 
metric  ellipsoid  with  the  largest  axes  (corresponding  to  a  large  cr  in 
the  Gaussian)  along  the  principal  components  and  the  small  axis  (cor¬ 
responding  to  a  small  a  in  the  Gaussian)  orthogonal  to  it:  thus  even 
vectors  that  are  far  away  (in  the  ordinary  euclidean  metric)  are  close  in 
this  metric  if  they  lie  in  the  hyperplane  of  the  principal  components  and 
even  close  vectors  (in  the  ordinary  metric)  are  far  away  in  the  metric 
induced  by  W  if  they  are  orthogonal  to  the  principal  components. 

4.  In  the  case  of  N  examples,  n  =  N  fixed  centers  and  M  =  I,  there  are 
enough  data  to  constrain  the  N  coefficients  ca  to  be  found.  Moving 
centers  add  another  nd  parameters  ( d  is  the  number  of  input  compo¬ 
nents)  and  the  matrix  M  another  independent  parameters.  Thus 
the  number  of  examples  N  must  be  sufficiently  large  to  constrain  ad¬ 
equately  the  free  parameters  -  n  d-dimensional  centers,  n  coefficients 
ca  and  independent  entries  of  the  matrix  M.  Thus 
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.  <p  +  d 

N  >>  n  +  nd  - - — . 

it 


5.  In  the  case  of  Gaussian  basis  functions,  learning  the  entries  of  a  diagonal 
W  is  equivalent  to  learning  the  variances  of  each  two-dimensional  (or 
one-dimensional)  Gaussian  receptive  field  for  each  center.  It  is  clear 
that  sets  of  units  with  different  scales  (see  section  3.1)  correspond  to 
sets  of  units  with  different  W. 
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